Classifying Factored Genres with Part-of-Speech Histograms
نویسندگان
چکیده
This work addresses the problem of genre classification of text and speech transcripts, with the goal of handling genres not seen in training. Two frameworks employing different statistics on word/POS histograms with a PCA transform are examined: a single model for each genre and a factored representation of genre. The impact of the two frameworks on the classification of training-matched and new genres is discussed. Results show that the factored models allow for a finer-grained representation of genre and can more accurately characterize genres not seen in training.
منابع مشابه
Features for factored language models for code-Switching speech
This paper presents investigations of features which can be used to predict Code-Switching speech. For this task, factored language models are applied and implemented into a state-of-the-art decoder. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and open class word clusters are explored. We find that Brown word clusters, part-of-speech tag...
متن کاملGenre in Semantic Networks: A study of the Lexicon of News Articles
Our project aims at understanding text genres within the domain of the news. Advances in computational methods and availability of digital corpora has ushered in a new age of empirically testing intuitions about genres and styles, in particular the automatic classification of a document to its genre. At the same time, identifying systematic patterns of difference between genres, both quantitati...
متن کاملFactored Translation between Brazilian Portuguese and English
Factored translation is an extension of the state-of-theart phrase-based statistical machine translation (PB-SMT). The main difference in factored translation approach is that a word is not only a token (its surface form) but a vector composed of different information such as lemma, part-of-speech or morphologic/syntactic tags. In this paper we present some experiments carried out to train and ...
متن کاملRescoring n-best lists for Russian speech recognition using factored language models
In this paper, we present a research of factored language model (FLM) for rescoring N-best lists for Russian speech recognition task. As a baseline language model we used a 3gram language model. Both baseline and factored language models were trained on a text corpus collected from recent news texts on Internet sites of online newspapers; total size of the corpus is about 350 million words (2.4...
متن کاملSpeech Recognition on English-Mandarin Code-Switching Data using Factored Language Models - with Part-of-Speech Tags, Language ID and Code-Switch Point Probability as Factors pdfsubject=Multilingual Speech Recognition
Code-switching is defined as ”the alternate use of two or more languages in the same utterance or conversation” [1]. CS is a wide-spread phenomenon in multilingual communities, where multiple languages are concurrently used in a conversation. For automatic speech recognition (ASR), particularly intra-sentential code-switching poses an interesting challenge due to the multilingual context for la...
متن کامل